Lecture 3: Review
We covered:
data wrangling and types of variable names
meta data
project design
summary statistics
graphing the mean and standard error graphs
pipes and %>% or |> and how to group_by
Our last graph
Lecture 4: How to deal with data wrangling
Introduction to probability distributions
What is a frequency distribution?
What is a probability distribution?
Distributions for variables and for statistics
Estimation
Populations and samples
Parameters and statistics
we are going to use some sculpin data that is real!
Lecture 4: Frequency distributions
Example data we will use will be a combination of data from Toolik Alaska LTER
source
We will specifically look at fishes like
Grayling
Lecture 4: Frequency distributions
Data - has been cleaned in terms of lake names and species names
Slimy Sculpin - Toolik Lake
sculpin_df %>%
filter (lake == "Toolik" ) %>%
summarize (
mean = mean (total_length_mm, na.rm = TRUE ),
sd = sd (total_length_mm, na.rm = TRUE ),
se = sd (total_length_mm, na.rm = TRUE )/ (sum (! is.na (total_length_mm))^ 0.5 ),
count = sum (! is.na (total_length_mm)),
.groups = "drop" )
# A tibble: 1 × 4
mean sd se count
<dbl> <dbl> <dbl> <int>
1 51.7 12.0 0.834 208
Note in the quarto code we use things to control what we see like
#What does this mean
# | echo: false
# | message: false
# | warning: false
# | fig-height: 4 # | fig-width: 3
# | paged-print: false)
Data - has been cleaned in terms of lake names and species names
Slimy Sculpin - Toolik Lake
# Write your code here to read in data
# Remember to use tidy coding skills and comment the HOOI
#
#
# library(tidyverse)
# library(patchwork)
# sculpin_df <- read_csv("data/sculpin.csv")
# now look at what is there
Let’s try looking at what the summary of the data tell us
# now do the summary statistics please
Lecture 4: Frequency Distributions
What is a frequency distribution?
Display of number of observations in certain intervals
e.g., the number of sculpin per interval in Toolik Lake
as a table like below or histogram
sculpin_df %>%
filter (lake == "Toolik" ) %>%
filter (! is.na (total_length_mm)) %>%
mutate (length_bin = cut_interval (total_length_mm, length = 2 )) %>%
count (length_bin)
# A tibble: 29 × 2
length_bin n
<fct> <int>
1 [10,12] 1
2 (12,14] 3
3 (18,20] 1
4 (22,24] 1
5 (26,28] 1
6 (28,30] 1
7 (30,32] 2
8 (32,34] 3
9 (34,36] 4
10 (36,38] 3
# ℹ 19 more rows
Let’s try looking at what the summary of the data tell us
# now try different bins
sculpin_df %>%
filter (lake == "Toolik" ) %>%
filter (! is.na (total_length_mm)) %>%
mutate (length_bin = cut_interval (total_length_mm, length = 2 )) %>%
count (length_bin)
# A tibble: 29 × 2
length_bin n
<fct> <int>
1 [10,12] 1
2 (12,14] 3
3 (18,20] 1
4 (22,24] 1
5 (26,28] 1
6 (28,30] 1
7 (30,32] 2
8 (32,34] 3
9 (34,36] 4
10 (36,38] 3
# ℹ 19 more rows
Lecture 4: Frequency Distributions
The alternative is to use a histogram
the y axis is the count
the x axis is the bin range
each bin 0 - 5 and 5 - 10 and 10 - 15 or as you choose
in ggplot the code looks like
dataframe %>% ggplot (aes (thing_to_count))+
geom_histogram (
binwidth = increments_to_work_with
)
Let’s try stuffing frogs in our pockets
# Write your code here to create funny plot
# Remember to use tidy coding skills and comment the HOOI
Lecture 4: Frequency Distributions
What happens as sample size changes…
Sampls size
Low sample number - 15
High sample number - 70
Frequency distribution takes on “bell-shape”…
Lecture 4: Probability distributions
Can we make assumption about distribution of random variable weight in population?
Probability distribution:
theoretical frequency distribution in population
Lecture 4: Probability distributions
For continuous random var: probability density function (PDF)
PDF: mathematical expression of probabilities associated with getting certain values of random variable
Area under curve = 1
i.e., probability of length between 10 and 80 = 1
Lecture 4: Probability distributions
Now we could look at a lot of different ranges of lengths
probability of the length larger than the mean
probability of the length larger than 70 mm
probability of the length between two numbers
Lecture 4: Probability distributions
Usually need to know probability distribution of random variables in statistical analyses
Can define many distributions; some do reasonable job especially whit continuous variables
Different distributions for continuous, discrete variables like a single die
Lecture 4: Probability distributions
Normal (Gaussian): symmetrical, bell-shaped
Defined in terms of mean and variance (μ, 𝜎2)
SND (z-distribution) has mean μ=0 , 𝜎2 =1
\(f(y) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y - \mu)^2}{2\sigma^2}}\)
Lecture 4: Probability distributions
Lognormal: right-skewed distribution
Logarithm of random variable is normally distributed
Common in biology.
Why would this occur or be common in biology?
Lecture 4: Probability distributions
Binomial (multinomial):
probability of event that have two outcomes (heads/ tails, dead/alive)
Defined in terms of “successes” out of set number of trials
In large number of trials: approximately normal distribution
Lecture 4: Probability distributions
Poisson: occurrences of (rare) event in time/space
E.g., number of
Taraxacum officinale - common dandelion in quadrat
copepod eaten per minute
cells in field of view
Measures Probability(y= certain integer value)
Lecture 4: Data gathering - managing
Also have distributions of test statistics
Test statistics:
summary values calculated from data used to test hypotheses
is your result due to chance?
Different test statistics:
different, well-defined distributions
allows estimation of probabilities associated with results
Examples:
z-distribution, student’s t-distribution, χ2-distribution, F-distribution
Lecture 4: Samples and populations
Inferential statistics:
inference from samples to populations
Statistical population:
All possible observations of interest
Normally: populations too large to census
Populations are defined in time + space
Examples of statistical populations from you research area?
Lecture 4: Samples and populations
Key characteristic of sample is
size (n observations; n = sample size)
Characteristics of population - called parameters
Parameters - Greek letters
Characteristics of samples - statistical estimates of parameters
statistics= Latin letters
Random sampling crucial for
sample -> population
inference statistics -> parameters
Lecture 4: Parameters and statistics
Two main kinds of summary statistics: - center and spread
Center: - Mean (µ, ȳ): sum of sampled values divided by n - Mode: the most common number in dataset - Median: middle measurement of data; = mean for normal distributions
Mean
\(\mu = \frac{\sum\limits_{i=1}^{n} Y_i}{n}\)
Formula for n odd
\(\text{median = } Y\_{(n+1)/2}\)
Formula for n even
\(\text{median = }\frac{Y_{n/2} + Y_{(n/2)+1}}{ 2}\)
Lecture 4: Parameters and statistics
Spread
Range: from highest and lowest observation
Variance (σ2, s2): sum of squared differences of observations from mean, divided by n-1
E.g., fish lengths = 20, 30, 35, 24, 36 g
# A tibble: 1 × 1
mean
<dbl>
1 29
\(s^2 = \sum_{i=1}^{n} \frac{(y_i - \bar{y})^2}{n-1}\)
Lecture 4: Parameters and statistics
Spread
(20 -29)^2+ (30 -29)^2 + (35 -29)^2 + (24 -29)^2 + (36 -29)^2 = 57,104
192 / (5-1) = 48 mm^2 Problem: weird units!
# A tibble: 1 × 2
mean variance
<dbl> <dbl>
1 29 48
Lecture 4: Parameters and statistics
Spread
Standard Deviation(σ, s): square root of variance.
Coefficient of variation: SD as % of mean.
Useful for comparing spread in samples with different means
In example: (6.9/29)*100= 23.8 %
\(s^2 = \sqrt{\sum_{i=1}^{n} \frac{(y_i - \bar{y})^2}{n-1}}\)
\(\text{Coefficient of variation} = \frac{S}{\bar{Y}} \times 100\)
Lecture 4: Estimation
Problem: - don’t know the values of parameters
Goal: - estimate parameters from empirical data (samples)
3 general methods of parameter estimation: - Maximum Likelihood Estimation (MLE) - Ordinary Least Squares (OLS) - Resampling techniques
MLE general method to estimate parameters in a way that maximizes the likelihood of the observed data given the parameter values.
aims to find the parameter values that make the observed data most probable under the assumed statistical model.
OLS specific method to estimate parameters of a linear regression model.
minimizes the sum of the squared differences between observed and predicted values
Lecture 4: Estimation
Back to top